Skip to content

fix(memory): ingestion reliability and merge bloat#604

Open
slvnlrt wants to merge 6 commits into
spacedriveapp:mainfrom
slvnlrt:pr/ingestion-fixes
Open

fix(memory): ingestion reliability and merge bloat#604
slvnlrt wants to merge 6 commits into
spacedriveapp:mainfrom
slvnlrt:pr/ingestion-fixes

Conversation

@slvnlrt

@slvnlrt slvnlrt commented Jun 24, 2026

Copy link
Copy Markdown

Summary

Four related fixes to the memory ingestion pipeline and the maintenance merge. They were found during a real ingestion run that produced a runaway loop of duplicate memories. All are backend-agnostic; the only schema change is an additive migration on ingestion_files.

Fixes

1. Deterministic chunk completionsrc/agent/ingestion.rs
The per-chunk ingestion agent was required to call memory_persistence_complete. If it didn't (e.g. a smaller model skipping the signal), the chunk was marked failed even though memory_save had already committed. Chunk completion is now decided by the agent run returning Ok; the persistence-contract state is kept only for observability. This removes the false failure that fed the retry loop below.

2. Bounded retries with backoff + quarantinesrc/agent/ingestion.rs (+ migration)
A failed ingestion file was re-processed on every poll cycle indefinitely (no cap, no backoff), continuously re-creating memories. Adds attempts + next_attempt_at to ingestion_files: failures back off exponentially and are quarantined after a maximum attempt count, so the poll loop stops re-processing them.

3. Ingest delete removes the source filesrc/api/ingest.rs
Deleting an ingest file only removed the ingestion_files row, leaving the file on disk; the poll loop then re-discovered it and re-created the row ("reappears"). Delete now removes the disk file and purges both ingestion_files and ingestion_progress.

4. Canonical merge instead of concatenationsrc/memory/maintenance.rs
Maintenance merged near-duplicate memories by concatenating their content (winner\n\nloser), bloating a single memory with many reworded copies of the same fact. It now keeps the importance-winner's content as the single canonical version.

Migration

migrations/20260623000001_ingestion_retry_budget.sql — additive only (ADD COLUMN attempts, ADD COLUMN next_attempt_at). ingestion_files.status has no CHECK constraint, so the new terminal quarantined value needs no schema change.

Tests

  • New unit tests: retry/backoff + quarantine decision, the merge canonical-content behavior, and the delete purge (asserts the disk file is removed and both tracking tables are emptied).
  • Existing maintenance integration tests updated to assert the canonical (non-concatenated) survivor content.
  • cargo test --lib — all green.

Note

Four fixes addressing a runaway ingestion loop: deterministic chunk completion tracking, bounded retries with exponential backoff and quarantine, deletion of source files on ingest delete, and canonical-content merging instead of concatenation.

Written by Tembo for commit d6239885. This will update automatically on new commits.

slvnlrt added 4 commits June 24, 2026 21:45
A chunk was marked as failed when the per-chunk ingestion agent did not emit the memory_persistence_complete signal, even though memory_save had already committed. Combined with the retry loop this re-saved the same memories indefinitely. Chunk completion is now decided by the agent run returning Ok (the contract state is kept for observability only).
A failed ingestion file was re-processed on every poll cycle forever (no cap, no backoff). Add attempts + next_attempt_at columns to ingestion_files: failed files back off exponentially and are quarantined after a max attempt count, so the poll loop stops re-processing them.
Deleting an ingest file only removed the ingestion_files row, leaving the file on disk so the poll loop re-discovered it and re-created the row. Delete now removes the disk file and purges both ingestion_files and ingestion_progress.
Maintenance merged near-duplicate memories by concatenating their content, bloating a memory with many reworded copies of the same fact. Keep the importance-winner's content as the single canonical version.
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1312bab9-c2a4-449a-9628-a73e767ececa

📥 Commits

Reviewing files that changed from the base of the PR and between e4a2eec and fd987e7.

📒 Files selected for processing (3)
  • src/agent/ingestion.rs
  • src/api/ingest.rs
  • src/memory/maintenance.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/api/ingest.rs
  • src/memory/maintenance.rs
  • src/agent/ingestion.rs

Walkthrough

Adds ingestion retry-budget tracking with quarantine/backoff handling, updates ingest-file deletion to purge disk and database rows, and changes memory merge output to keep the winner text while adjusting tests for the new behavior.

Changes

Ingestion retry flow

Layer / File(s) Summary
Retry budget contract and state
migrations/20260623000001_ingestion_retry_budget.sql, src/agent/ingestion.rs
Adds retry-budget columns, retry decision helpers, and retry-state loading from ingestion_files.
Retry gate and failure handling
src/agent/ingestion.rs
Skips quarantined or not-yet-due files, then increments attempts and either quarantines or reschedules the file after failure, or completes and cleans up on success.
Chunk completion and tests
src/agent/ingestion.rs
Changes chunk success to follow the LLM result, logs saved-memory counts, and updates tests for backoff, quarantine, and chunk completion behavior.

Ingest file purge

Layer / File(s) Summary
Purge helper, endpoint, and test
src/api/ingest.rs
Adds disk-and-database purge logic for ingest files, wires the delete endpoint to it, and verifies the helper with a SQLite-backed test.

Memory merge content

Layer / File(s) Summary
Canonical merged content
src/memory/maintenance.rs
Changes merged_memory_content to return the winner text, updates existing merge assertions, and adds tests for non-concatenating near-duplicate merges.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Suggested reviewers

  • jamiepine
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the main changes: ingestion reliability fixes and maintenance merge deduplication.
Description check ✅ Passed The description accurately summarizes the ingestion and merge changes in the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/api/ingest.rs (1)

241-245: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Avoid abbreviated binding name e.

Use error to match the project convention (and the spelled-out error already used in delete_ingest_file at Line 286).

♻️ Proposed rename
         match tokio::fs::remove_file(&path).await {
             Ok(()) => {}
-            Err(e) if e.kind() == std::io::ErrorKind::NotFound => {}
-            Err(e) => return Err(e.into()),
+            Err(error) if error.kind() == std::io::ErrorKind::NotFound => {}
+            Err(error) => return Err(error.into()),
         }

As per coding guidelines: "Don't abbreviate variable names. Use queue not q, message not msg, channel not ch."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/ingest.rs` around lines 241 - 245, Rename the abbreviated error
binding in the `remove_file` match inside `delete_ingest_file` to `error` to
match the project’s naming convention and the existing usage later in the same
function. Update both the `NotFound` guard and the fallback `Err(...)` arm to
use the spelled-out `error` name consistently.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/agent/ingestion.rs`:
- Around line 355-360: The completion flow in ingestion handling deletes
progress before the source file removal can fail, which can leave the source
intact but chunk markers cleared. In the ingestion success path around
complete_ingestion_file, delete_progress, and tokio::fs::remove_file, move the
source-file deletion to happen before clearing ingestion_progress, and only call
delete_progress after remove_file succeeds so a failed delete cannot trigger
chunk reprocessing.

In `@src/memory/maintenance.rs`:
- Around line 318-320: The fallback path in the merge logic returns
loser_trimmed directly when winner_trimmed is empty, which bypasses
MAX_MERGED_MEMORY_CONTENT_BYTES enforcement. Update the trimming flow in the
relevant merge function in maintenance.rs so the loser fallback also goes
through the same max-size truncation used for normal merged content, preserving
the byte limit regardless of which side wins.

---

Nitpick comments:
In `@src/api/ingest.rs`:
- Around line 241-245: Rename the abbreviated error binding in the `remove_file`
match inside `delete_ingest_file` to `error` to match the project’s naming
convention and the existing usage later in the same function. Update both the
`NotFound` guard and the fallback `Err(...)` arm to use the spelled-out `error`
name consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b014c1b-e1df-4aa2-9ddd-cf84d4556a76

📥 Commits

Reviewing files that changed from the base of the PR and between ac52277 and d623988.

📒 Files selected for processing (4)
  • migrations/20260623000001_ingestion_retry_budget.sql
  • src/agent/ingestion.rs
  • src/api/ingest.rs
  • src/memory/maintenance.rs

Comment thread src/agent/ingestion.rs Outdated
Comment thread src/memory/maintenance.rs Outdated
Comment thread src/memory/maintenance.rs Outdated
if loser_trimmed.is_empty() {
return winner_trimmed.to_string();
if winner_trimmed.is_empty() {
return loser_trimmed.to_string();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge-case: if winner_trimmed is empty, this early-return skips the MAX_MERGED_MEMORY_CONTENT_BYTES truncation. Might be worth applying the same truncation logic here too.

Suggested change
return loser_trimmed.to_string();
if winner_trimmed.is_empty() {
if loser_trimmed.len() <= MAX_MERGED_MEMORY_CONTENT_BYTES {
return loser_trimmed.to_string();
}
let boundary = loser_trimmed.floor_char_boundary(MAX_MERGED_MEMORY_CONTENT_BYTES);
return loser_trimmed[..boundary].to_string();
}

Comment thread src/api/ingest.rs
.await?
{
let path = ingest_dir.join(&filename);
match tokio::fs::remove_file(&path).await {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this deletes a disk path derived from ingestion_files.filename, consider rejecting any non-normal path components (e.g. .. / absolute paths) before join as defense-in-depth.

Suggested change
match tokio::fs::remove_file(&path).await {
let filename_path = Path::new(&filename);
if filename_path
.components()
.any(|c| !matches!(c, std::path::Component::Normal(_)))
{
anyhow::bail!("invalid ingest filename: {filename}");
}
let path = ingest_dir.join(filename_path);

- ingestion: remove the source file before clearing chunk progress, so a failed remove_file cannot drop progress markers and cause already-ingested chunks to be re-processed (and re-create memories).

- memory: apply the MAX_MERGED_MEMORY_CONTENT_BYTES cap on the empty-winner fallback path in merged_memory_content.

- api: reject non-normal path components (`..`, absolute) before deleting an ingest file (defense-in-depth on a destructive op); rename error binding for consistency.
slvnlrt added a commit to slvnlrt/spacebot that referenced this pull request Jun 25, 2026
Cross-backend fixes raised by the upstream PR reviews, applied to the integration branch: ingest delete-before-clear-progress, merged_memory_content size cap on the empty-winner path, ingest delete path-component guard, agent_id on the SQLite backend store, deterministic (weight-descending) edge ordering in graph traversal, chunked IN-list queries to stay under SQLite's bind limit, and FTS-index ensure on set_embedding. Feature-off: 891 tests green. Feature-on: clippy clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant